The goal of this project to apply machine learning techniques to electronic medical health records of patients with heart failure condition and then find the best model for such prediction task. This work is also intended to replicate the paper published in February 2020 from Chicoo & Jurman which was mostly done in R notebook.
The dataset used in this project is publicly available here under the Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Attribute Information:
Thirteen (13) clinical features:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from sklearn import model_selection
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import train_test_split, cross_validate, KFold, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn import preprocessing
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, roc_auc_score
import shap
from xgboost import XGBClassifier
import warnings
warnings.filterwarnings("ignore", category=UserWarning)
data = pd.read_csv('heart_failure_clinical_records_dataset.csv')
data.head()
| age | anaemia | creatinine_phosphokinase | diabetes | ejection_fraction | high_blood_pressure | platelets | serum_creatinine | serum_sodium | sex | smoking | time | DEATH_EVENT | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 75.0 | 0 | 582 | 0 | 20 | 1 | 265000.00 | 1.9 | 130 | 1 | 0 | 4 | 1 |
| 1 | 55.0 | 0 | 7861 | 0 | 38 | 0 | 263358.03 | 1.1 | 136 | 1 | 0 | 6 | 1 |
| 2 | 65.0 | 0 | 146 | 0 | 20 | 0 | 162000.00 | 1.3 | 129 | 1 | 1 | 7 | 1 |
| 3 | 50.0 | 1 | 111 | 0 | 20 | 0 | 210000.00 | 1.9 | 137 | 1 | 0 | 7 | 1 |
| 4 | 65.0 | 1 | 160 | 1 | 20 | 0 | 327000.00 | 2.7 | 116 | 0 | 0 | 8 | 1 |
print(f"We have a total of {data.shape[0]} patients")
We have a total of 299 patients
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 299 entries, 0 to 298 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 age 299 non-null float64 1 anaemia 299 non-null int64 2 creatinine_phosphokinase 299 non-null int64 3 diabetes 299 non-null int64 4 ejection_fraction 299 non-null int64 5 high_blood_pressure 299 non-null int64 6 platelets 299 non-null float64 7 serum_creatinine 299 non-null float64 8 serum_sodium 299 non-null int64 9 sex 299 non-null int64 10 smoking 299 non-null int64 11 time 299 non-null int64 12 DEATH_EVENT 299 non-null int64 dtypes: float64(3), int64(10) memory usage: 30.5 KB
data.isnull().sum()
age 0 anaemia 0 creatinine_phosphokinase 0 diabetes 0 ejection_fraction 0 high_blood_pressure 0 platelets 0 serum_creatinine 0 serum_sodium 0 sex 0 smoking 0 time 0 DEATH_EVENT 0 dtype: int64
Looks like we do not have missing values. But the data type for 'age' is float64 so we will need to change the float data type to integer.
data['age'] = data['age'].apply(np.int64)
#Only use this if wanna reset seaborn theme
#matplotlib.rc_file_defaults()
#sns.reset_orig
#sns.set_palette("Accent_r")
sns.pairplot(data, hue='DEATH_EVENT', palette='viridis')
<seaborn.axisgrid.PairGrid at 0x12a338ac0>
def category_count(data:pd.DataFrame, col:str):
assert type(data) == pd.DataFrame
assert type(col) == str
zero = one = 0
one = [one+1 for i in data[col] if i == 1]
zero = [zero+1 for i in data[col] if i == 0]
return sum(zero), sum(one)
# Create subplots: use 'domain' type for Pie subplot
fig = make_subplots(rows=2, cols=3, specs=[[{'type':'domain'}, {'type':'domain'},{'type':'domain'}],\
[{'type':'domain'}, {'type':'domain'},{'type':'domain'}]])
labels = [1, 0]
hole=0
gender_vals = list(category_count(data, 'sex'))
death_vals = list(category_count(data, 'DEATH_EVENT'))
anaemia_vals = list(category_count(data, 'anaemia'))
high_blood_pressure_vals = list(category_count(data, 'high_blood_pressure'))
diabetes_vals = list(category_count(data, 'diabetes'))
smoking_vals = list(category_count(data, 'smoking'))
fig.add_trace(go.Pie(labels=labels, values=gender_vals, name="Sex", hole=hole),
1, 1)
fig.add_trace(go.Pie(labels=labels, values=death_vals, name="Deaths", hole=hole),
1, 2)
fig.add_trace(go.Pie(labels=labels, values=anaemia_vals, name="Anaemia", hole=hole),
1, 3)
fig.add_trace(go.Pie(labels=labels, values=high_blood_pressure_vals, name="high_blood_pressure", hole=hole),
2, 1)
fig.add_trace(go.Pie(labels=labels, values=diabetes_vals, name="diabetes", hole=hole),
2, 2)
fig.add_trace(go.Pie(labels=labels, values=smoking_vals, name="smoking", hole=hole),
2, 3)
fig.update_layout(
title_text="Categorical features",
annotations=[dict(text='Sex', x=0.12, y=1.1, font_size=15, showarrow=False),
dict(text='Deaths', x=0.50, y=1.1, font_size=15, showarrow=False),
dict(text='Anaemia', x=0.895, y=1.1, font_size=15, showarrow=False),
dict(text='High blood pressure', x=0.07, y=0.5, font_size=15, showarrow=False),
dict(text='Diabetes', x=0.50, y=0.5, font_size=15, showarrow=False),
dict(text='Smoking', x=0.895, y=0.5, font_size=15, showarrow=False)])
fig.show(renderer='png')
According to the dataset owner, the patients consist of 105 female and 194 male. This correspond with the following label 0 for female and 1 for male.
#plt.figure(figsize=(9, 5))
fig, axes = plt.subplots(3, 2, figsize=(12, 6))
fig.suptitle('Numerical features')
sns.histplot(data, ax=axes[0,0], x='serum_creatinine', kde=True, palette='viridis')
sns.histplot(data, ax=axes[0,1], x='ejection_fraction', kde=True, palette='viridis')
sns.histplot(data, ax=axes[1,0], x='serum_sodium', kde=True, palette='viridis')
sns.histplot(data, ax=axes[1,1], x='creatinine_phosphokinase', kde=True, palette='viridis')
sns.histplot(data, ax=axes[2,0], x='platelets', kde=True, palette='viridis')
sns.histplot(data, ax=axes[2,1], x='time', kde=True, palette='viridis')
plt.tight_layout()
plt.show()
fig, axes = plt.subplots(3, 2, figsize=(12, 6))
fig.suptitle('Numerical features by survival')
sns.histplot(data, ax=axes[0,0], x='serum_creatinine', hue='DEATH_EVENT', kde=True, palette='viridis')
sns.histplot(data, ax=axes[0,1], x='ejection_fraction', hue='DEATH_EVENT', kde=True, palette='viridis')
sns.histplot(data, ax=axes[1,0], x='serum_sodium', hue='DEATH_EVENT', kde=True, palette='viridis')
sns.histplot(data, ax=axes[1,1], x='creatinine_phosphokinase', hue='DEATH_EVENT', kde=True, bins=100, palette='viridis')
sns.histplot(data, ax=axes[2,0], x='platelets', hue='DEATH_EVENT', kde=True, palette='viridis')
sns.histplot(data, ax=axes[2,1], x='time', hue='DEATH_EVENT', kde=True, bins=30, palette='viridis')
plt.tight_layout()
plt.show()
#plt.rcParams["patch.force_edgecolor"] = True
sns.set_style('whitegrid')
sns.set_palette='flare'
fig, axes = plt.subplots(2, 3, figsize=(12, 6))
fig.suptitle('Numerical features by gender')
sns.boxplot(ax=axes[0, 0], data=data, x='sex', y='platelets', palette='viridis')
sns.boxplot(ax=axes[0, 1], data=data, x='sex', y='serum_creatinine', palette='viridis')
sns.boxplot(ax=axes[0, 2], data=data, x='sex', y='ejection_fraction', palette='viridis')
sns.boxplot(ax=axes[1, 0], data=data, x='sex', y='serum_sodium', palette='viridis')
sns.boxplot(ax=axes[1, 1], data=data, x='sex', y='creatinine_phosphokinase', palette='viridis')
fig.tight_layout()
fig, axes = plt.subplots(1, 4, figsize=(18, 4))
sns.scatterplot(ax=axes[0], data=data, x='age', y='serum_creatinine', hue='DEATH_EVENT', alpha=0.8, palette='viridis')
sns.scatterplot(ax=axes[1], data=data, x='age', y='serum_sodium', hue='DEATH_EVENT', alpha=0.8, palette='viridis')
sns.scatterplot(ax=axes[2], data=data, x='age', y='creatinine_phosphokinase', hue='DEATH_EVENT', alpha=0.8, palette='viridis')
sns.scatterplot(ax=axes[3], data=data, x='age', y='ejection_fraction', hue='DEATH_EVENT', alpha=0.8, palette='viridis')
fig.tight_layout()
fig, axes = plt.subplots(2, 3, figsize=(18, 8))
sns.scatterplot(ax=axes[0,0], data=data, x='serum_creatinine', y='serum_sodium', hue='DEATH_EVENT', alpha=0.8, palette='viridis')
sns.scatterplot(ax=axes[0,1], data=data, x='serum_creatinine', y='creatinine_phosphokinase', hue='DEATH_EVENT', alpha=0.8, palette='viridis')
sns.scatterplot(ax=axes[0,2], data=data, x='serum_creatinine', y='ejection_fraction', hue='DEATH_EVENT', alpha=0.8, palette='viridis')
sns.scatterplot(ax=axes[1,0], data=data, x='serum_creatinine', y='age', hue='DEATH_EVENT', alpha=0.8, palette='viridis')
sns.scatterplot(ax=axes[1,1], data=data, x='serum_creatinine', y='platelets', hue='DEATH_EVENT', alpha=0.8, palette='viridis')
fig.tight_layout()
fig, axes = plt.subplots(2, 3, figsize=(18, 8))
sns.scatterplot(ax=axes[0,0], data=data, x='time', y='serum_sodium', hue='DEATH_EVENT', alpha=0.8, palette='viridis')
sns.scatterplot(ax=axes[0,1], data=data, x='time', y='creatinine_phosphokinase', hue='DEATH_EVENT', alpha=0.8, palette='viridis')
sns.scatterplot(ax=axes[0,2], data=data, x='time', y='ejection_fraction', hue='DEATH_EVENT', alpha=0.8, palette='viridis')
sns.scatterplot(ax=axes[1,0], data=data, x='time', y='age', hue='DEATH_EVENT', alpha=0.8, palette='viridis')
sns.scatterplot(ax=axes[1,1], data=data, x='time', y='platelets', hue='DEATH_EVENT', alpha=0.8, palette='viridis')
sns.scatterplot(ax=axes[1,2], data=data, x='time', y='serum_creatinine', hue='DEATH_EVENT', alpha=0.8, palette='viridis')
fig.tight_layout()
fig, axes = plt.subplots(2, 3, figsize=(14, 6))
fig.suptitle('Categorical features by death events')
sns.violinplot(ax=axes[0,0], data=data, x='DEATH_EVENT', y='age', hue='sex', palette='viridis', split=True)
sns.violinplot(ax=axes[0,1], data=data, x='DEATH_EVENT', y='smoking', hue='sex', palette='viridis', split=True)
sns.violinplot(ax=axes[0,2], data=data, x='DEATH_EVENT', y='anaemia', hue='sex', palette='viridis', split=True)
sns.violinplot(ax=axes[1,0], data=data, x='DEATH_EVENT', y='diabetes', hue='sex', palette='viridis', split=True)
sns.violinplot(ax=axes[1,1], data=data, x='DEATH_EVENT', y='high_blood_pressure', hue='sex', palette='viridis', split=True)
fig.tight_layout()
fig, axes = plt.subplots(2, 3, figsize=(14, 6))
fig.suptitle('Numerical features by gender')
sns.violinplot(ax=axes[0, 0], data=data, x='sex', y='platelets', palette='flare', split=True)
sns.violinplot(ax=axes[0, 1], data=data, x='sex', y='serum_creatinine', palette='flare', split=True)
sns.violinplot(ax=axes[0, 2], data=data, x='sex', y='ejection_fraction', palette='flare', split=True)
sns.violinplot(ax=axes[1, 0], data=data, x='sex', y='serum_sodium', palette='flare', split=True)
sns.violinplot(ax=axes[1, 1], data=data, x='sex', y='creatinine_phosphokinase', palette='flare', split=True)
fig.tight_layout()
data['time'].describe()
count 299.000000 mean 130.260870 std 77.614208 min 4.000000 25% 73.000000 50% 115.000000 75% 203.000000 max 285.000000 Name: time, dtype: float64
survived = pd.DataFrame(data.loc[(data['DEATH_EVENT'] == 0)])
died = pd.DataFrame(data.loc[(data['DEATH_EVENT'] == 1)])
print(f"There are {survived.shape[0]} patients who survived")
print(f"There are {died.shape[0]} patients who died")
There are 203 patients who survived There are 96 patients who died
def bin_time(data):
one = two = three = four = five \
= six = seven = eight = nine = ten = 0
for t in data['time']:
if t < 30:
one+=1
if t > 31 and t <=60:
two+=1
if t > 61 and t <=90:
three+=1
if t > 91 and t <=120:
four+=1
if t > 121 and t <=150:
five+=1
if t > 151 and t <=180:
six+=1
if t > 181 and t <=210:
seven+=1
if t > 211 and t <=240:
eight+=1
if t > 241 and t <=285:
nine+=1
return one, two, three, four, five, six, seven, eight, nine
month_range_labels = ['0-30','31-60', '61-90', '91-120', '121-150', '151-180', '211-240', '241-285']
#We create a list of month range based on survivors and the deceased
mr_both = list(bin_time(data))
mr_died = list(bin_time(died))
mr_survived = list(bin_time(survived))
mr_died = list(bin_time(died))
#Check the values
print(f'Both survived and deceased:{mr_both}')
print(f'Survived:{mr_survived}')
print(f'Died: {mr_died}')
Both survived and deceased:[35, 22, 50, 42, 20, 14, 43, 26, 32] Survived:[4, 4, 36, 35, 15, 6, 39, 24, 32] Died: [31, 18, 14, 7, 5, 8, 4, 2, 0]
#Then we couple the month range labels with respective list of survivors and deceased
zip_mr_both_data = list(zip(month_range_labels, mr_both))
zip_mr_survive_data = list(zip(month_range_labels, mr_survived))
zip_mr_died_data = list(zip(month_range_labels, mr_died))
month_range_both_df = pd.DataFrame(zip_mr_both_data, columns=['Month range', 'total in range'])
month_range_survived_df = pd.DataFrame(zip_mr_survive_data, columns=['Month range', 'total in range'])
month_range_died_df = pd.DataFrame(zip_mr_died_data, columns=['Month range', 'total in range'])
month_range_both_df.head(10), month_range_survived_df.head(10)
( Month range total in range 0 0-30 35 1 31-60 22 2 61-90 50 3 91-120 42 4 121-150 20 5 151-180 14 6 211-240 43 7 241-285 26, Month range total in range 0 0-30 4 1 31-60 4 2 61-90 36 3 91-120 35 4 121-150 15 5 151-180 6 6 211-240 39 7 241-285 24)
month_range_survived_df.head(10)
| Month range | total in range | |
|---|---|---|
| 0 | 0-30 | 4 |
| 1 | 31-60 | 4 |
| 2 | 61-90 | 36 |
| 3 | 91-120 | 35 |
| 4 | 121-150 | 15 |
| 5 | 151-180 | 6 |
| 6 | 211-240 | 39 |
| 7 | 241-285 | 24 |
month_range_died_df.head(10)
| Month range | total in range | |
|---|---|---|
| 0 | 0-30 | 31 |
| 1 | 31-60 | 18 |
| 2 | 61-90 | 14 |
| 3 | 91-120 | 7 |
| 4 | 121-150 | 5 |
| 5 | 151-180 | 8 |
| 6 | 211-240 | 4 |
| 7 | 241-285 | 2 |
sns.set(style="whitegrid")
fig, axes = plt.subplots(1, 3, figsize=(20, 6), constrained_layout=True)
sns.barplot(ax=axes[0], data=month_range_both_df, x='Month range', y='total in range', palette='copper')
sns.barplot(ax=axes[1], data=month_range_survived_df, x='Month range', y='total in range', palette='viridis')
sns.barplot(ax=axes[2], data=month_range_died_df, x='Month range', y='total in range', palette='flare')
axes[0].set_title("Month range for both survivors and deceased")
axes[1].set_title("Month range for survivors")
axes[2].set_title("Month for deceased")
fig.tight_layout()
conditions = [
(data['time'] < 30),
(data['time'] > 30) & (data['time'] <=60),
(data['time'] > 60) & (data['time'] <=90),
(data['time'] > 90) & (data['time'] <=120),
(data['time'] > 120) & (data['time'] <=150),
(data['time'] > 150) & (data['time'] <=180),
(data['time'] > 180) & (data['time'] <=210),
(data['time'] > 210) & (data['time'] <=240),
(data['time'] > 240) & (data['time'] <=270),
(data['time'] > 270)
]
time_index = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
data['day_range'] = np.select(conditions, time_index)
#Verify if the range is correct
data.head(300)
| age | anaemia | creatinine_phosphokinase | diabetes | ejection_fraction | high_blood_pressure | platelets | serum_creatinine | serum_sodium | sex | smoking | time | DEATH_EVENT | day_range | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 75 | 0 | 582 | 0 | 20 | 1 | 265000.00 | 1.9 | 130 | 1 | 0 | 4 | 1 | 0 |
| 1 | 55 | 0 | 7861 | 0 | 38 | 0 | 263358.03 | 1.1 | 136 | 1 | 0 | 6 | 1 | 0 |
| 2 | 65 | 0 | 146 | 0 | 20 | 0 | 162000.00 | 1.3 | 129 | 1 | 1 | 7 | 1 | 0 |
| 3 | 50 | 1 | 111 | 0 | 20 | 0 | 210000.00 | 1.9 | 137 | 1 | 0 | 7 | 1 | 0 |
| 4 | 65 | 1 | 160 | 1 | 20 | 0 | 327000.00 | 2.7 | 116 | 0 | 0 | 8 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 294 | 62 | 0 | 61 | 1 | 38 | 1 | 155000.00 | 1.1 | 143 | 1 | 1 | 270 | 0 | 8 |
| 295 | 55 | 0 | 1820 | 0 | 38 | 0 | 270000.00 | 1.2 | 139 | 0 | 0 | 271 | 0 | 9 |
| 296 | 45 | 0 | 2060 | 1 | 60 | 0 | 742000.00 | 0.8 | 138 | 0 | 0 | 278 | 0 | 9 |
| 297 | 45 | 0 | 2413 | 0 | 38 | 0 | 140000.00 | 1.4 | 140 | 1 | 1 | 280 | 0 | 9 |
| 298 | 50 | 0 | 196 | 0 | 45 | 0 | 395000.00 | 1.6 | 136 | 1 | 1 | 285 | 0 | 9 |
299 rows × 14 columns
plt.figure(figsize=(8,6))
sns.histplot(data, x='day_range', hue='DEATH_EVENT', kde=True, bins=10)
plt.tight_layout()
plt.show()
data.head()
| age | anaemia | creatinine_phosphokinase | diabetes | ejection_fraction | high_blood_pressure | platelets | serum_creatinine | serum_sodium | sex | smoking | time | DEATH_EVENT | day_range | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 75 | 0 | 582 | 0 | 20 | 1 | 265000.00 | 1.9 | 130 | 1 | 0 | 4 | 1 | 0 |
| 1 | 55 | 0 | 7861 | 0 | 38 | 0 | 263358.03 | 1.1 | 136 | 1 | 0 | 6 | 1 | 0 |
| 2 | 65 | 0 | 146 | 0 | 20 | 0 | 162000.00 | 1.3 | 129 | 1 | 1 | 7 | 1 | 0 |
| 3 | 50 | 1 | 111 | 0 | 20 | 0 | 210000.00 | 1.9 | 137 | 1 | 0 | 7 | 1 | 0 |
| 4 | 65 | 1 | 160 | 1 | 20 | 0 | 327000.00 | 2.7 | 116 | 0 | 0 | 8 | 1 | 0 |
feats = ['age', 'anaemia', 'creatinine_phosphokinase', 'diabetes',\
'serum_creatinine', 'ejection_fraction', 'serum_sodium', 'sex', 'smoking']
X = pd.DataFrame(data, columns= feats)
y = pd.DataFrame(data, columns=['DEATH_EVENT'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
RandomForestClassifier()
rfc.feature_importances_
array([0.17512247, 0.02213126, 0.15813011, 0.02768513, 0.25887082,
0.19399937, 0.11526326, 0.02939665, 0.01940094])
plt.figure(figsize=(8,6))
plt.barh(feats, rfc.feature_importances_)
plt.tight_layout()
plt.show()
explainer2 = shap.TreeExplainer(rfc)
shap_values2 = explainer2.shap_values(X_test)
plt.figure(figsize=(15,5))
shap.summary_plot(shap_values2[1], X_test, plot_type="bar")
shap.summary_plot(shap_values2[1], X_test)
plt.tight_layout()
plt.show()
<Figure size 432x288 with 0 Axes>
xgbc = XGBClassifier()
xgbc.fit(X_train, y_train)
[15:12:15] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.300000012, max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)
print(xgbc.feature_importances_)
[0.12243439 0.04476107 0.08592099 0.07200531 0.21268545 0.14464846 0.08809499 0.11469523 0.11475415]
plt.figure(figsize=(15,5))
plt.barh(feats, xgbc.feature_importances_)
plt.show()
For the regression model, we will be experimenting against Random Forest Classifier, eXtreme Gradient Boosting, Voting Classifier, Näive Bayes Classifier and Decision Tree Classifier.
feats1 = ['serum_creatinine', 'ejection_fraction', 'age']
feats2 = ['serum_creatinine', 'ejection_fraction']
feats3 =['serum_creatinine', 'ejection_fraction', 'age', 'creatinine_phosphokinase']
feats4 =['serum_creatinine', 'ejection_fraction', 'day_range']
feats5 =['serum_creatinine', 'ejection_fraction', 'day_range', 'creatinine_phosphokinase']
X = pd.DataFrame(data, columns= feats4)
y = pd.DataFrame(data, columns=['DEATH_EVENT'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
from sklearn.preprocessing import StandardScaler, MinMaxScaler
#sc = StandardScaler()
#X_train = sc.fit_transform(X_train)
#X_test = sc.transform(X_test)
mmsc = MinMaxScaler()
X_train = mmsc.fit_transform(X_train)
X_test = mmsc.transform(X_test)
rfc = RandomForestClassifier(n_estimators=100)
rfc.fit(X_train, y_train)
RandomForestClassifier()
from sklearn.feature_selection import SelectFromModel
model = SelectFromModel(rfc, prefit=True)
train_new = model.transform(X_train)
parameters = {
'n_estimators' : [320,330,340],
'max_depth' : [8, 9, 10, 11, 12],
'random_state' : [0],
#'max_features': ['auto'],
#'criterion' :['gini']
}
from sklearn.model_selection import GridSearchCV
gscv = GridSearchCV(RandomForestClassifier(), parameters, cv=10, n_jobs=-1)
gscv.fit(train_new, y_train)
GridSearchCV(cv=10, estimator=RandomForestClassifier(), n_jobs=-1,
param_grid={'max_depth': [8, 9, 10, 11, 12],
'n_estimators': [320, 330, 340], 'random_state': [0]})
print(gscv.score(train_new, y_train))
print(gscv.best_params_)
0.8660287081339713
{'max_depth': 8, 'n_estimators': 320, 'random_state': 0}
rfc_gs=RandomForestClassifier(random_state=0, n_estimators= 320, max_depth=8)
rfc_gs.fit(X_train, y_train)
RandomForestClassifier(max_depth=8, n_estimators=320, random_state=0)
rfcgs_pred = rfc_gs.predict(X_test)
print(confusion_matrix(y_test,rfcgs_pred))
print(classification_report(y_test,rfcgs_pred))
print(accuracy_score(y_test, rfcgs_pred))
print(round(roc_auc_score(y_test, rfcgs_pred),2))
[[47 6]
[12 25]]
precision recall f1-score support
0 0.80 0.89 0.84 53
1 0.81 0.68 0.74 37
accuracy 0.80 90
macro avg 0.80 0.78 0.79 90
weighted avg 0.80 0.80 0.80 90
0.8
0.78
tn, fp, fn, tp = confusion_matrix(y_test, rfcgs_pred).ravel()
specificity = tn / (tn+fp)
sensitivity = tp / (tp+fn)
round(specificity, 2)
0.89
round(sensitivity,2)
0.68
cfm = confusion_matrix(y_test,rfcgs_pred)
cfm_df = pd.DataFrame(cfm, columns=np.unique(y_test), index = np.unique(y_test))
cfm_df.index.name = 'Actual'
cfm_df.columns.name = 'Predicted'
plt.figure(figsize = (6,4))
sns.heatmap(cfm_df, cmap="YlGnBu", annot=True,annot_kws={"size": 16})
<AxesSubplot:xlabel='Predicted', ylabel='Actual'>
from sklearn.neighbors import KNeighborsClassifier
grid_params = {
'n_neighbors': [5,6,7,8,9,10,11,12,13,14,15,16],
'weights': ['uniform', 'distance'],
'metric': ['euclidean', 'manhattan']
}
knngscv = GridSearchCV(KNeighborsClassifier(), grid_params, cv=3, n_jobs=-1)
knngscv.fit(X_train, y_train)
GridSearchCV(cv=3, estimator=KNeighborsClassifier(), n_jobs=-1,
param_grid={'metric': ['euclidean', 'manhattan'],
'n_neighbors': [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16],
'weights': ['uniform', 'distance']})
knngscv.best_params_
{'metric': 'euclidean', 'n_neighbors': 5, 'weights': 'distance'}
knn = KNeighborsClassifier(n_neighbors=10, weights='distance')
knn.fit(X_train, y_train)
KNeighborsClassifier(n_neighbors=10, weights='distance')
knn_pred = knn.predict(X_test)
print(classification_report(y_test,knn_pred))
print(accuracy_score(y_test, knn_pred))
precision recall f1-score support
0 0.75 0.89 0.81 53
1 0.78 0.57 0.66 37
accuracy 0.76 90
macro avg 0.76 0.73 0.73 90
weighted avg 0.76 0.76 0.75 90
0.7555555555555555
cfm_knn = confusion_matrix(y_test,knn_pred)
cfmknn_df = pd.DataFrame(cfm_knn, columns=np.unique(y_test), index = np.unique(y_test))
cfmknn_df.index.name = 'Actual'
cfmknn_df.columns.name = 'Predicted'pla
plt.figure(figsize = (6,4))
sns.heatmap(cfmknn_df, cmap="YlGnBu", annot=True,annot_kws={"size": 16})
plt.show()
from sklearn.ensemble import VotingClassifier
estimator = []
estimator.append(('LR', LogisticRegression()))
estimator.append(('XGBC', XGBClassifier()))
estimator.append(('RFC', RandomForestClassifier()))
vch = VotingClassifier(estimators = estimator, voting ='hard')
vch.fit(X_train, y_train)
[15:13:15] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
VotingClassifier(estimators=[('LR', LogisticRegression()),
('XGBC',
XGBClassifier(base_score=None, booster=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None, gamma=None,
gpu_id=None, importance_type='gain',
interaction_constraints=None,
learning_rate=None,
max_delta_step=None, max_depth=None,
min_child_weight=None, missing=nan,
monotone_constraints=None,
n_estimators=100, n_jobs=None,
num_parallel_tree=None,
random_state=None, reg_alpha=None,
reg_lambda=None,
scale_pos_weight=None,
subsample=None, tree_method=None,
validate_parameters=None,
verbosity=None)),
('RFC', RandomForestClassifier())])
vcpred = vch.predict(X_test)
print(classification_report(y_test,vcpred))
print(accuracy_score(y_test, vcpred))
precision recall f1-score support
0 0.80 0.91 0.85 53
1 0.83 0.68 0.75 37
accuracy 0.81 90
macro avg 0.82 0.79 0.80 90
weighted avg 0.81 0.81 0.81 90
0.8111111111111111
vcs = VotingClassifier(estimators = estimator, voting ='soft')
vcs.fit(X_train, y_train)
[15:13:16] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
VotingClassifier(estimators=[('LR', LogisticRegression()),
('XGBC',
XGBClassifier(base_score=None, booster=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None, gamma=None,
gpu_id=None, importance_type='gain',
interaction_constraints=None,
learning_rate=None,
max_delta_step=None, max_depth=None,
min_child_weight=None, missing=nan,
monotone_constraints=None,
n_estimators=100, n_jobs=None,
num_parallel_tree=None,
random_state=None, reg_alpha=None,
reg_lambda=None,
scale_pos_weight=None,
subsample=None, tree_method=None,
validate_parameters=None,
verbosity=None)),
('RFC', RandomForestClassifier())],
voting='soft')
vcspred = vcs.predict(X_test)
print(classification_report(y_test,vcspred))
print(accuracy_score(y_test, vcspred))
print(round(roc_auc_score(y_test, vcspred),2))
precision recall f1-score support
0 0.84 0.89 0.86 53
1 0.82 0.76 0.79 37
accuracy 0.83 90
macro avg 0.83 0.82 0.83 90
weighted avg 0.83 0.83 0.83 90
0.8333333333333334
0.82
tn, fp, fn, tp = confusion_matrix(y_test, vcspred).ravel()
specificity = tn / (tn+fp)
sensitivity = tp / (tp+fn)
round(specificity,2)
0.89
round(sensitivity,2)
0.68
cfm_vcs = confusion_matrix(y_test,vcspred)
cfmvcs_df = pd.DataFrame(cfm_vcs, columns=np.unique(y_test), index = np.unique(y_test))
cfmvcs_df.index.name = 'Actual'
cfmvcs_df.columns.name = 'Predicted'
plt.figure(figsize = (6,4))
sns.heatmap(cfmvcs_df, cmap="YlGnBu", annot=True,annot_kws={"size": 16})
plt.show()
xgbc2 = XGBClassifier()
xgbc2.fit(X_train, y_train)
[15:13:16] WARNING: /Users/travis/build/dmlc/xgboost/src/learner.cc:1095: Starting in XGBoost 1.3.0, the default evaluation metric used with the objective 'binary:logistic' was changed from 'error' to 'logloss'. Explicitly set eval_metric if you'd like to restore the old behavior.
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
importance_type='gain', interaction_constraints='',
learning_rate=0.300000012, max_delta_step=0, max_depth=6,
min_child_weight=1, missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=8, num_parallel_tree=1, random_state=0,
reg_alpha=0, reg_lambda=1, scale_pos_weight=1, subsample=1,
tree_method='exact', validate_parameters=1, verbosity=None)
xgbc2_pred = xgbc2.predict(X_test)
print(confusion_matrix(y_test,xgbc2_pred))
print(classification_report(y_test,xgbc2_pred))
print(accuracy_score(y_test, xgbc2_pred))
[[43 10]
[ 9 28]]
precision recall f1-score support
0 0.83 0.81 0.82 53
1 0.74 0.76 0.75 37
accuracy 0.79 90
macro avg 0.78 0.78 0.78 90
weighted avg 0.79 0.79 0.79 90
0.7888888888888889
cfm_xgbc2 = confusion_matrix(y_test,xgbc2_pred)
cfmxgbc_df = pd.DataFrame(cfm_xgbc2, columns=np.unique(y_test), index = np.unique(y_test))
cfmxgbc_df.index.name = 'Actual'
cfmxgbc_df.columns.name = 'Predicted'
plt.figure(figsize = (6,4))
sns.heatmap(cfmxgbc_df, cmap="YlGnBu", annot=True,annot_kws={"size": 16})
plt.show()
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
GaussianNB()
gnb_pred = gnb.predict(X_test)
print(accuracy_score(y_test, gnb_pred))
0.7
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
DecisionTreeClassifier()
dtc_pred = dtc.predict(X_test)
print(confusion_matrix(y_test,dtc_pred))
print(classification_report(y_test,dtc_pred))
print(accuracy_score(y_test, dtc_pred))
[[43 10]
[13 24]]
precision recall f1-score support
0 0.77 0.81 0.79 53
1 0.71 0.65 0.68 37
accuracy 0.74 90
macro avg 0.74 0.73 0.73 90
weighted avg 0.74 0.74 0.74 90
0.7444444444444445
tn, fp, fn, tp = confusion_matrix(y_test, dtc_pred).ravel()
specificity = tn / (tn+fp)
sensitivity = tp / (tp+fn)
specificity
0.8113207547169812
sensitivity
0.6486486486486487
cfm_dtc = confusion_matrix(y_test,dtc_pred)
cfmdtc_df = pd.DataFrame(cfm_dtc, columns=np.unique(y_test), index = np.unique(y_test))
cfmdtc_df.index.name = 'Actual'
cfmdtc_df.columns.name = 'Predicted'
plt.figure(figsize = (6,4))
sns.heatmap(cfmdtc_df, cmap="YlGnBu", annot=True,annot_kws={"size": 16})
plt.show()
from sklearn.metrics import roc_curve
# predict probabilities
pred_prob1 = rfc_gs.predict_proba(X_test)
pred_prob2 = vcs.predict_proba(X_test)
pred_prob3 = knn.predict_proba(X_test)
# roc curve for models
fpr1, tpr1, thresh1 = roc_curve(y_test, pred_prob1[:,1], pos_label=1)
fpr2, tpr2, thresh2 = roc_curve(y_test, pred_prob2[:,1], pos_label=1)
fpr3, tpr3, thresh3 = roc_curve(y_test, pred_prob3[:,1], pos_label=1)
random_probs = [0 for i in range(len(y_test))]
p_fpr, p_tpr, _ = roc_curve(y_test, random_probs, pos_label=1)
# auc scores
auc_score1 = roc_auc_score(y_test, pred_prob1[:,1])
auc_score2 = roc_auc_score(y_test, pred_prob2[:,1])
auc_score3 = roc_auc_score(y_test, pred_prob3[:,1])
print(auc_score1, auc_score2, auc_score3)
0.8748087710351862 0.8788883222845487 0.8378378378378378
plt.plot(fpr1, tpr1, linestyle='--',color='orange', label='Random Forest Classifier')
plt.plot(fpr2, tpr2, linestyle='--',color='green', label='Voting Classifier')
plt.plot(fpr3, tpr3, linestyle='--',color='red', label='KNN Classifier')
plt.plot(p_fpr, p_tpr, linestyle='--', color='blue')
# title
plt.title('ROC curve')
# x label
plt.xlabel('False Positive Rate')
# y label
plt.ylabel('True Positive rate')
plt.legend(loc='best')
plt.savefig('ROC',dpi=300)
plt.show()
We changed the age data type from float to integer.
We mapped the follow up period in the 'time' feature to represent one month instead of days.
We created two new dataframes that represented patients who died and also those survived heart failures. Then from these dataframes, we conducted a series of analysis including categorizing the survival rate into month range in order to understand the descrepancies between the survivors and the deceased.
We used Random Forest Regressor, Random Forest Classifier, XGBoost and SHAP classifier to find important features. Based on the result of the feature importance, we have identified that the most common top features are serum_creatinine, age, ejection_fraction and creatinine_phosphokinase.
We compared the selected features in order to identify the descrepancies of prediction accuracy. In this case, we experimented with three different groups of features:
a) serum_creatinine, ejection_fraction
b) serum_creatinine, age, ejection_fraction
c) serum_creatinine, age, ejection_fraction, creatinine_phosphokinase
Voting Classifier with soft voting ensemble performs the best amongst all other classifiers in the selected list with an F1-score of 0.83. Random Forest Classifier's accuracy score on the other hand is at 0.80. In the medical use cases, recall is a much more important metric than the accuracy score because having to predict better actual positives ensures that we did not let any positive cases go undetected.
In this case, the recall result for Voting Classifier is 0.89 and 0.76 for non-death and death events respectively, as compared as to Random Forest classifier with 0.89 and 0.68 respectively.
As for specificity measure, both Voting Classifier and Random Forest Classifier have the same result with 0.89. Both of the classifiers also have the same sensitivity score of 0.68.
However, we would like to point out that the dataset is an imbalanced dataset, so we would like to conduct further investigation in this.